Training Parameters Tab and Settings

The Training Parameters tab, shown below, includes a set of basic settings for training a deep model, as well advanced settings that let you modify the default settings of the selected optimization algorithm and to add metric and callback functions.

Training Parameters tab

A. Basic settings B. Advanced settings

Basic Settings

The basic settings that you need to set to train a deep model are available in the top section of the Training Parameters tab, as shown below.

Basic settings

Training Parameters

Loss functions and optimization algorithms play a very important role in efficiently and effectively training a deep model to produce accurate results. Different tasks require different sets of these functions to achieve the most optimum results. Refer to Loss Functions and Optimization Algorithms Demystified (medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c) for more information about loss functions and optimization algorithms.

Basic settings
	Description
Input (Patch) Size	During training, training data is split into smaller 2D data patches, which is defined by the 'Input (Patch) Size' parameter. For example, if you choose an Input (Patch) Size of 64, the Deep Learning Tool will cut the dataset into sub-sections of 64x64 pixels. These subsections will then be used as the training dataset. By subdividing images, each pass or 'epoch' should be faster and use less memory.
Stride to Input Ratio	The 'Stride to Input Ratio' specifies the overlap between adjacent patches. At a value of '1.0', there will be no overlap between patches and they will be extracted sequentially one after another. At a value of '0.5', there will be a 50% overlap. You should note that any value greater than '1.0' will result in gaps between data patches.
Epochs Number	A single pass over all the data patches is called epoch, and the number of epochs is controlled by the 'Epochs Number' parameter.
Batch Size	Patches are randomly processed in batches and the 'Batch Size' parameter determines the number of patches in a batch.
Loss Function	Loss functions, which are selectable in the drop-down menu, measure the error between the neural network's prediction and reality. The error is then used to update the model parameters. You should note that not all the loss functions will work well with all models and the available selections are automatically filtered according to the model type — Regression (for super-resolution and denoising) and Semantic Segmentation (for binary and multi-class segmentations). Regressive loss functions… Are used in cases of regressive problems, that is when the target variable is continuous. One of the most widely used regressive loss functions is Mean Squared Error. Other loss functions you might consider are Cosine Similarity, Huber, Mean Absolute Error, Poisson, and others listed in the drop-down menu (see Loss Functions for Regression Models). Semantic segmentation loss functions… Are used in cases of segmentation problems, that is when the target output is a multi-ROI. When training a multi-class segmentation model, 'CategoricalCrossentropy' is generally a good choice as a classification for each pixel must be made. See Loss Functions for Semantic Segmentation Models for additional information about the available loss functions. Note Go to www.tensorflow.org/api_docs/python/tf/keras/losses for additional information about loss functions.
Optimization Algorithm	Optimization algorithms are used to update the parameters of the model so that prediction errors are minimized. Optimization is a procedure in which the gradient — the partial derivative of the loss function with respect to the network's parameters — is first computed and then the model weights are modified by a given step size in the direction opposite of the gradient until a local minimum is achieved. Dragonfly's Deep Learning Tool provides several optimization algorithms — Adagrad, Adam, RMSProp, SDG (Stochastic Gradient Descent), and many others — which work well on different kinds of problems. In many cases, Adam is generally a good starting point. The default settings can be modified in the Advanced Settings (see Optimization Algorithm Parameters). Note You can find more information about optimization algorithms at www.tensorflow.org/api_docs/python/tf/keras/optimizers. You can also refer to the publication Demystifying Optimizations for Machine Learning (towardsdatascience.com/demystifying-optimizations-for-machine-learning-c6c6405d3eea).
Estimated Memory Ratio	Displays the estimated memory ratio, which is calculated as the ratio of your system's capability and the estimated memory needed to train the model at the current settings. You should note that the total memory requirements to train a model depends on the implementation and selected optimizer. In some cases, the size of the network may be bound by your system's available memory. Green … The estimated memory requirements are within your system's capabilities. Yellow … The estimated memory requirements are approaching your system's capabilities. Red … The estimated memory requirements exceed your system's capabilities. You should consider adjusting the model training parameters or selecting a shallower model. Note Memory is one of the biggest challenges in training deep neural networks. Memory is required to store input data, weight parameters and activations as an input propagates through the network. In training, activations from a forward pass must be retained until they can be used to calculate the error gradients in the backwards pass. As an example, the 50-layer ResNet network has about 26 million weight parameters and computes close to 16 million activations in the forward pass. If you use a 32-bit floating-point value to store each weight and activation this would give a total storage requirement of 168 MB. By using a lower precision value to store these weights and activations you could halve or even quarter this storage requirement. Note Refer to imatge-upc.github.io/telecombcn-2016-dlcv/slides/D2L1-memory.pdf for information about calculating memory requirements.
Show Advanced Settings	If selected, lets you access the Advanced Settings panel (see Advanced Settings).

Loss Functions for Regression Models

The following loss functions are available for regression models.

Loss functions for regression models
	Description
Huber	Similar to mean square error, Huber loss is linear when \|input - output\| > 1 and quadratic when \|input - output\| < 1. This allows for the Huber loss to be strongly convex in a uniform around \|input - output\| = 0. Huber loss has the benefit of reducing the effect of outliers while staying differentiable everywhere. Advantages Combines two very common loss functions — MSE and MAE. Disadvantages Converges slower than MSE. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/Huber [2] https://en.wikipedia.org/wiki/Huber_loss
LogCosh	Similar to Huber error, this loss functions behaves quadratically around \|input - output\| < 1 and behaves quasi-linearly elsewhere. Advantages Originally used to improve variational auto-encoder. Combines the basic idea of MSE and MAE in one function. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/LogCosh [2] https://openreview.net/pdf?id=rkglvsC9Ym
MeanAbsoluteError (MAE)	Mean absolute error, which is sometimes called 'l1 norm', is a measure of the difference between paired input and outputs. The error grows linearly the further apart the pairs are. Advantages Does not over grow too large when input-output pairs are far apart. Converges faster than other metrics when pairs are close. Disadvantages Other loss functions are better at reducing the occurrence of large outliers. Not differentiable around \|input - output\| = 0, which may cause problems. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/MeanAbsoluteError [2] https://en.wikipedia.org/wiki/Mean_absolute_error
MeanAbsolutePercentageError	Computes the mean absolute percentage error between `y_true` and `y_pred`. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/MeanAbsolutePercentageError
MeanSquaredError (MSE)	Similar to mean absolute error, mean squared error (sometimes called 'L2 norm') measures the squared distance between input and output pairs. The error grows exponentially the further apart the pairs are. Advantages One of the most common loss functions for regression. Converges faster when input-output pairs are further apart. Disadvantages Can be slow to converge when pairs are close as square difference shrinks when (input - output < 1). References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredError [2] https://en.wikipedia.org/wiki/Mean_squared_error
MeanSquaredLogarithmicError	Similar to mean squared error, mean squared logarithmic error is equivalent to MSE on log(input + 1) and log(output + 1). Advantages Drastically reduces the effect of outliers. Disadvantages Converges slower than MSE. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/MeanSquaredLogarithmicError
Poisson	Uses the property of Poisson distribution where mean and variance are equal. Poisson loss will include variance and mean information in ground truth labels when calculating the loss. Poisson will heavily punish the model for assigning a zero pixel intensity when there is small intensity in the target. Poisson prioritizes models that overshoot the mean in their regression than undershoot it. Advantages Useful when dealing with count data. Helps prevent 'zero-inflation' in sparse data when other losses might converge to the trivial solution of uniform zero intensity everywhere. Disadvantages It is common to do z-score normalization of a dataset which sets unit variance and the mean to 0. Poisson loss would not then be suitable. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/Poisson [2] https://en.wikipedia.org/wiki/Poisson_regression
OrsGradientLoss	A simple custom loss function that helps preserve image structures for super-resolution and is the squared error loss between the gradients in X and Y of true and predicted batch. Loss is normalized on twice the batch size to consider the two directions of gradient. Advantages Helps preserve image structures for super resolution / super enhancement. Disadvantages The input and output must be well registered or the translation may be learned. Noise has a high gradient and blotchy artifacts can be created in empty areas. References [1] Abrahamyan et al. Gradient Variance Loss for Structure-Enhanced Image SuperResolution, arXiv, February 2, 2022 (https://arxiv.org/pdf/2202.00997.pdf).
OrsMixedGradientLoss	This loss custom function is similar to OrsGradientLoss, but is more forgiving than gradient loss for dataset registration. Includes Mean Square Error in the loss calculation and Sobel edges are used instead of gradient. Advantages Good results with U-Net SR+ and U-Net 3D. Disadvantages Noise has a high gradient and blotchy artifacts can be created in empty areas. References [1] Zhengyang Lu and Ying Chen, Single Image Super Resolution Based on a Modified U-Net with Mixed Gradient Loss, arXiv, November 21, 2019 (https://arxiv.org/pdf/1911.09428.pdf).
OrsPerceptualLoss	Designed for use with Gan-style models, this custom loss function includes style loss (4 activation layers of VGG), feature loss (1 activation layer of VGG), TV loss, and MSE loss. This loss function tends to preserve structures well with less blur in output images. Disadvantages Uses memory to load VGG model. Weights of losses in total loss is fixed. References [1] Johnson et al., Perceptual Losses for Real-Time Style Transfer and Super-Resolution, arXiv, March 27, 2016 (https://arxiv.org/pdf/1603.08155.pdf).
OrsPsnr	PSNR (Peak signal to noise ratio) computes the ratio of the power of a signal and the power of noise in that signal. It is calculated in the log scale as the square of the maximum possible intensity over the mean squared error. Higher intensities in the image would create higher mean square errors so models are penalized less for having inputs with a high dynamic range. Advantages Contextualizes the MSE within the range of possible intensity values. PSNR is commonly used to quantify reconstruction quality. Disadvantages PSNR is highly sensitive to intensity range of training data and might not be applicable if real data has different intensities. Good measure for signal strength but does not always equate to better perceptual quality by humans. References [1] https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
OrsTotalVarianceLoss	Total variance defined by sum of integral of absolute value. This custom loss function tries to minimize the variance in an image while maintaining a fit to the original. Essentially tries to reproduce the image in piecewise constant functions. Advantages Does not require a target so usable in unsupervised learning. Weight between variance and image fit automatically adapted based on scale of image features. Disadvantages Less effective for more complicated structures. References [1] David Strong and Tony Chan, Edge-Preserving and Scale-Dependent Properties of Total Variation Regularization, Inverse Problems, Volume 19, Number 6 (https://iopscience.iop.org/article/10.1088/0266-5611/19/6/059).
OrsVGGFeatureLoss	This custom loss function compares the activation map at a layer of a pretrained VGG16 model between true and predicted batches. VGG style networks have shown good discriminative performance and can do image feature to image feature comparison instead of pixel to pixel. Advantages Promising results for super resolution. Tends to preserve structures well with less blur in output image. Disadvantages Uses memory to load VGG model. References [1] Johnson et al., Perceptual Losses for Real-Time Style Transfer and Super-Resolution, arXiv, March 27, 2016 (https://arxiv.org/pdf/1603.08155.pdf).

Loss Functions for Semantic Segmentation Models

The following loss functions are available for semantic segmentation models.

Loss functions for semantic segmentation models
	Description
CategoricalCrossentropy	Most common loss for classification. The model outputs an estimate for the likelihood of the class(es) a voxel belongs to — '0' meaning the model does not believe the class if present, '1' it strongly believes this class is present, and '0.5' meaning it does not know if it is present or not. Sum the log of predictions of true labels. When the model does not think the voxel to be in that class (true label is 1, predicted label is close to 0) the error grows. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy [2] https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy
CategoricalHinge	Linear loss calculated from a decision boundary, which is a weighted midpoint between two classes. For example, the decision boundary would lie at 0.75 in a system with 2 output classes 0 and 1, where 1 is 3 times more likely than 0. For an output of the model of 0.6, it could be said that class 1 was predicted. However, there would still be a loss of 0.15 despite finding the true class since the model was not sure enough of a overly-represented class. Advantages Avoids rewarding the model too much for finding easy-to-classify labels. Does not penalize the model as much for missing hard-to-classify labels. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalHinge [2] https://en.wikipedia.org/wiki/Hinge_loss
CosineSimilarity	The performance of the model is evaluated by finding the cosine of the angle between predicted and true labels, which are represented as vectors. For example, classes A, B, C are represented as: A = [1 0 0], B = [0 1 0] and C = [0 0 1]. For a model output of [0.9 0.1 0.1], the cosine similarity between the output and class A would be 0.92 Advantages Reflects relative rather than absolute comparison of Individual vector dimensions. Most appropriate when data frequency is more important than absolute values. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity [2] https://en.wikipedia.org/wiki/Cosine_similarity
KLDivergence	Measures how one probability distribution 'P' is different from a second reference distribution 'Q'. Originally used to compare to models, the KLDivergence measures the information gain achieved by using P over Q. In Dragonfly, Q represents the ground truth labels and is perfectly known, so the KLDivergence can be interpreted as the information loss between the model and the ground truth distribution. The closer the distribution of the model is to the ground truth the better. Advantages Fits the distribution of true labels nicely. Disadvantages Sensitive to distribution of labels in the training data. Will produce poor results if the training label distribution is not comparable to real data. References [1] www.tensorflow.org/api_docs/python/tf/keras/losses/KLDivergence [2] https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
OrsDiceLoss*	Similar to F1 score, evaluates the model's ability to maximize precision (how many positives are actually positive) and recall (how many positives were detected). `Precision = True positives/(True positives + False Positives) Recall = True Positives/(True positives + False Negatives)` References [1] https://arxiv.org/pdf/2006.14822.pdf
OrsJaccardDistance*	Measures the ratio of the intersection between two sets and the union of those two sets. The closer that ratio is to 1, the better the model is. Similar to dice loss, but weights False negatives and False positives differently while dice loss weights them the same. References [1] en.wikipedia.org/wiki/Jaccard_index

* The 'OrsDiceLoss' and 'OrsJaccardDistance' loss functions are often used when segmentation classes are unbalanced as they give all classes equal weight. However, you may note that training with these loss functions might be more unstable than with others. Refer to Salehi et al. Tversky loss function for image segmentation using 3D fully convolutional deep networks, Cornell University, 2017-06-17 (arxiv.org/pdf/1706.05721.pdf) for information about the implementation of these loss functions.

Optimization Algorithms

The following optimization algorithms are available for deep models. You should note that you can fine-tune the hyperparameters of the selected optimization algorithm to further enhance model accuracy (see Optimization Algorithm Parameters).

Optimization algorithms
	Description
Adadelta	Optimizer that implements the Adadelta algorithm. Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks — the continual decay of learning rates throughout training, and the need for a manually selected global learning rate. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta
Adagrad	Optimizer that implements the Adagrad algorithm. Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. The more updates a parameter receives, the smaller the updates. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad
Adam	Optimizer that implements the Adam algorithm. In many cases, Adam is generally a good starting point. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
Adamax	Optimizer that implements the Adamax algorithm, which is a variant of Adam based on the infinity norm. Adamax is sometimes superior to Adam, specially in models with embeddings. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adamax
Nadam	Optimizer that implements the Nadam algorithm, which is Adam with Nesterov momentum. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/Nadam
RMSprop	Optimizer that implements the RMSprop algorithm. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop
SGD	Stochastic gradient descent and momentum optimizer. References [1] www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD

Advanced Settings

The advanced settings let you modify the default settings of the selected optimization algorithm and to add metric and callback functions.

Optimization Algorithm Parameters

If required, your can fine-tune the hyperparameters of the selected optimization algorithm further enhance model accuracy. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters, typically node weights, are learned.

Options to set the hyperparameters of the selected optimization algorithm are available in the Optimization Algorithm Parameters box, as shown below.

Default settings for the Adam optimization algorithm

Optimization Algorithm Parameters

Parameters for fine-tuning an optimization algorithm
	Description
Algorithm	Indicates the optimization algorithm selected for model training.
Parameters	The parameters of the selected optimization algorithm appear here. You can find a description of each argument for the available algorithms as follows: Adadelta… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adadelta#args. Adagrad… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad#args. Adam… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam#args. Adamax… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adamax#args. Nadam… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Nadam#args. RMSprop… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop#args. SGD… https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD#args.
Name	Optional name prefix for the operations created when applying gradients. Defaults to the name of the selected optimization algorithm, for example, `Adam`. Note This parameter is not available for the Adadelta and SGD optimization algorithms.

Metrics

Metrics are functions that can be used to measure the performance of your model and are computed when a model is evaluated. The available metrics for estimating a model's performance are available in the Metrics drop-down menu, as shown below.

Metrics

Metrics

A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. Refer to www.tensorflow.org/api_docs/python/tf/keras/metrics/ for more information about metrics. Refer to https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics for additional information about evaluating the quality of a model’s predictions.

Metrics for Regression Models

The following metrics are available for measuring the performance of regression models.

Metrics for regression models
	Description
CosineSimilarity	Computes the cosine similarity between the labels and predictions, in which: CosineSimilarity = (a . b) / \|\|a\|\| \|\|b\|\|. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity [2] https://en.wikipedia.org/wiki/Cosine_similarity
LogCoshError	Computes the logarithm of the hyperbolic cosine of the prediction error, in which: LogCoshError = log((exp(x) + exp(-x))/2), where x is the error ('y_pred' - 'y_true'). References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/LogCoshError
MeanAbsoluteError (MAE)	Calculated by taking the median of all absolute differences between the target and the prediction, this metric is particularly robust to outliers. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanAbsoluteError [2] https://en.wikipedia.org/wiki/Mean_absolute_error
MeanAbsolutePercentageError	Also known as mean absolute percentage deviation (MAPD), computes the mean absolute percentage error between 'y_true' and 'y_pred'. This metric is sensitive to relative errors. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanAbsolutePercentageError
MeanSquaredError (MSE)	Computes mean square error, which is a metric corresponding to the expected value of the squared (quadratic) error or loss. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanSquaredError [2] https://en.wikipedia.org/wiki/Mean_squared_error
MeanSquaredLogarithmicError	Computes a metric corresponding to the expected value of the squared logarithmic (quadratic) error or loss. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanSquaredLogarithmicError
Poisson	Computes the Poisson metric between 'y_true' and 'y_pred', in which: Poisson= 'y_pred' - 'y_true' * log('y_pred'). References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Poisson [2] https://en.wikipedia.org/wiki/Poisson_regression
PrecisionAtRecall	Computes best precision where recall is greater than or equal to a specified value. Note This metric creates four local variables, 'true_positives', 'true_negatives', 'false_positives', and 'false_negatives' that are used to compute the precision at the given recall. The threshold for the given recall value is computed and used to evaluate the corresponding precision. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/PrecisionAtRecall
RecallAtPrecision	Computes best recall where precision is ≥ specified value. For a given score-label-distribution the required precision might not be achievable, in this case 0.0 is returned as recall. Note This metric creates four local variables, 'true_positives', 'true_negatives', 'false_positives', and 'false_negatives' that are used to compute the recall at the given precision. The threshold for the given precision value is computed and used to evaluate the corresponding recall. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/RecallAtPrecision
RooMeanSquaredError	Computes root mean squared error metric between 'y_true' and 'y_pred'. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/RootMeanSquaredError

Metrics for Semantic Segmentation Models

The following metrics are available for measuring the performance of semantic segmentation models.

Metrics for semantic segmentation (classification) models
	Description
CategoricalAccuracy	Calculates how often predictions match 'one-hot' labels. This metric creates two local variables, 'total' and 'count' that are used to compute the frequency with which 'y_pred' matches 'y_true'. This frequency is ultimately returned as 'categorical accuracy': an idempotent operation that simply divides 'total' by 'count'. Note 'y_pred' and 'y_true' are passed in as vectors of probabilities, rather than as labels. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CategoricalAccuracy
CategoricalCrossentropy	Computes the crossentropy metric between the labels and predictions. Labels are given as a 'one_hot' representation. For example, when label values are [2, 0, 1], 'y_true' = [[0, 0, 1], [1, 0, 0], [0, 1, 0]]. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CategoricalCrossentropy
CategoricalHinge	Computes the categorical hinge metric between 'y_true' and 'y_pred'. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CategoricalHinge
CosineSimilarity	Computes the cosine similarity between the labels and predictions, in which: CosineSimilarity = (a . b) / \|\|a\|\| \|\|b\|\|. References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity [2] https://en.wikipedia.org/wiki/Cosine_similarity
KLDivergence	Computes Kullback-Leibler divergence metric between 'y_true' and 'y_pred', in which: KLDivergence = 'y_true' * log('y_true' /'y_pred'). References [1] https://www.tensorflow.org/api_docs/python/tf/keras/metrics/KLDivergence
OrsDiceCoefficient	Similar to F1 score, evaluates the model's ability to maximize precision (how many positives are actually positive) and recall (how many positives were detected).
OrsJaccardSimilarityCoefficient	Similar to Intersection-Over-Union (IoU), evaluates the ratio of the intersection between two sets and the union of those two sets.
OrsTopKCategoricalAccuracy	Computes the number of times that the correct label is among the top K (2, 3, 4, or 5) labels predicted (ranked by predicted scores). Computes how often targets are in the top K (two, three, or four) predictions. References [1] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.top_k_accuracy_score.html#sklearn.metrics.top_k_accuracy_score

Callbacks

Callbacks are functions called at particular time points during the training process, usually at the end of a training epoch or at the end of batch processing. In the current version of the Deep Learning Tool, five callbacks are supported to help the training process. These are available in the Callbacks box, as shown below.

Callbacks

Callbacks

Refer to www.tensorflow.org/api_docs/python/tf/keras/callbacks for more information about the different callbacks that are available in the Deep Learning Tool.

The current callbacks settings are applied to all new models by default.

Callbacks options
	Description
Early Stopping	Stops training upon a particular condition, for example if `val_loss` reaches a specific value, or if the results do not improve (see Early Stopping).
Model Checkpoint	Saves the model during the training (see Model Checkpoint).
Reduce LR on Plateau	Reduces the learning rate (lr) when a selected metric has stopped improving (see Reduce LR on Plateau).
Terminate on NaN	Terminates training when a NaN loss (Not a Number) is encountered. It is usually useful to select this callback, in order to stop training when a problem is encountered. Note Refer to www.tensorflow.org/api_docs/python/tf/keras/callbacks/TerminateOnNaN for more information about this callback.

Although callbacks are often beneficial, they might cause undesired consequences when used improperly. For example, Early Stopping might stop a training process too early, while a better solution could have been found if the training process would have been allowed to continue. This is why callbacks appear in the Advanced Settings section of training parameters and are turned off by default.

Early Stopping

The Early Stopping callback can be set to stop training when a monitored quantity has stopped improving. This can help prevent overfitting. A good idea when using early stopping is to choose a patience level that is coherent with the selected number of epochs.

Early Stopping callback

Early Stopping callback

Refer to www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping for more information about the Early Stopping callback.

Early stopping options
	Description
baseline	Is the baseline value for the monitored quantity to reach. Training will stop if the model doesn't show improvement over the baseline.
min_delta	Is the minimum change in the monitored quantity to qualify as an improvement. An absolute change of less than `min_delta`, will count as no improvement.
mode	Determines when training will stop — `Min`, `Max`, or `Auto`. Min… Training will stop when the quantity monitored has stopped decreasing. For example, when `val_loss` is being monitored. Max… Training will stop when the quantity monitored has stopped increasing. For example, when `val_categorical_accuracy` is being monitored. Auto… The mode — `Min` or `Max` — is automatically inferred from the name of the monitored quantity.
monitor	Lets you choose the quantity to be monitored, for example, `val_loss`. For semantic segmentation models, the quantities that can be monitored include `categorical_accuracy`, `loss`, `val_categorical_accuracy`, and `val_loss`. For regression models, the quantities that can be monitored include `loss`, `mean_squared_error`, `val_loss`, and `val_mean_squared_error`. Note Statistics related to the monitored quantities appear on the progress bar during training and in the Training Results dialog.
patience	The number of epochs with no improvement after which training will be stopped.
restore_best_weights	If `True`, the model weights from the epoch with the best value of the monitored quantity will be restored when the model is compiled. If `False`, the model weights obtained at the last step of training will be used.
verbose	Lets you choose an option — `0` (silent) or `1` (verbose) — for producing detailed logging information.

Model Checkpoint

This callback can be configured to monitor a certain quantity during training and to save only the best model.

Model Checkpoint callback

Model Checkpoint callback

Refer to https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint for more information about this callback.

Model Checkpoint parameters
	Description
load_weights_on_restart	`True` or `False` (the default setting is 'False'). If `True`, the model will attempt to load the checkpoint file from `filepath` at the start of `model.fit()`.
mode	`Min`, `Max`, or `Auto` (the default setting is 'Auto'). Determines if the current save file should be overwritten, based on either the minimization or maximization of the monitored quantity, and `save_best-only = True`. For `val_loss` this should be `Min`, for `val_categorical_accuracy` this should be `Max`, and so on. For `Auto`, the direction is automatically inferred from the name of the monitored quantity.
monitor	Lets you choose the quantity to be monitored, for example, `val_loss`. For semantic segmentation models, the quantities that can be monitored include `categorical_accuracy`, `loss`, `val_categorical_accuracy`, and `val_loss`. For regression models, the quantities that can be monitored include `loss`, `mean_squared_error`, `val_loss`, and `val_mean_squared_error`. Note Statistics related to the monitored quantities appear on the progress bar during training and in the Training Results dialog.
save_best_only	`True` or `False` (the default setting is 'False') If `True`, the latest best model according to the quantity monitored will not be overwritten.
save_freq	Determines the frequency — `epoch` or an integer — in which the model is saved. The default setting is 'epoch'. epoch… The callback saves the model after each epoch. Integer… The callback saves the model at end of a batch at which this many samples have been seen since last saving. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable as it could reflect as little as 1 batch, since the metrics get reset every epoch.
verbose	Lets you choose an option — `0` (silent) or `1` (verbose) — for producing detailed logging information.

Reduce LR on Plateau

This callback can automatically reduce the learning rate of the selected optimization algorithm by a specified factor when the monitored quantity stops improving. This can be especially useful when the selected optimizer does not automatically adapt its learning rate. For example, SDG (Stochastic Gradient Descent) does not adapt automatically, but Adam does.

Reduce LR on Plateau callback

Reduce LR on Plateau callback

Refer to www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau for more information about this callback.

Reduce LR on Plateau parameters
	Description
cooldown	The number of epochs to wait before resuming normal operation after the learning rate has been reduced.
factor	The factor by which the learning rate will be reduced. Calculated as: new_lr = lr * factor.
min_delta	The threshold for measuring the new optimum, to only focus on significant changes.
min_lr	The lower bound on the learning rate.
monitor	Lets you choose the quantity to be monitored, for example, `val_loss`. For semantic segmentation models, the quantities that can be monitored include `categorical_accuracy`, `loss`, `val_categorical_accuracy`, and `val_loss`. For regression models, the quantities that can be monitored include `loss`, `mean_squared_error`, `val_loss`, and `val_mean_squared_error`.
patience	The number of epochs with no improvement, after which learning rate will be reduced.
verbose	Lets you choose an option — `0` (silent) or `1` (verbose) — for producing detailed logging information.